The ISI/USC MT system

نویسندگان

Ignacio Thayer

Emil Ettelaie

Kevin Knight

Daniel Marcu

Dragos Stefan Munteanu

Franz Josef Och

Quamrul Tipu

چکیده

The ISI/USC machine translation system is a statistical system based on a phrase translation model that is trained on bilingual parallel data. This translation model is combined with several other knowledge sources in a log-linear manner. The weights of the individual components in the log-linear model are set by an automatic parameter-tuning method. The system described here has been developed for translating news text, and is a simplified version of the one we participated with in the NIST 2004 MT evaluation. We give a brief overview of the components of the system and discuss its performance at IWSLT. 1. The ISI/USC Machine Translation System Our machine translation system uses a log-linear model to combine several different knowledge sources into a direct model of translation. The 12 different models used to score hypothesized translations are given in Table 1. We also give more in-depth descriptions of the major components. 1.1. Translation Model At the core of the system is the alignment template translation model, which learns many-to-many mappings between word sequences from parallel bilingual data. A sentence is translated by segmenting a source-language sentence into phrases, translating these phrases with the ones observed in the training data, and reordering the target-language phrases. More details about the alignment template approach to machine translation used here are given in [1], [2]. For the IWSLT evaluation for Chineseand Japanese-toEnglish, we trained the alignment template system on the 20,000 lines of bilingual basic travel expressions provided by the organizers. For the “additional” evaluation condition for Chinese, we used 6 of the allowed corpora provided by LDC. For the “unrestricted” evaluation condition for Chinese, we used 167M words of parallel news and political data obtained from LDC in addition to the provided data. When mixing the provided in-domain data with out-of-domain data, the in-domain data was weighted by a factor of 5, and was resegmented with the LDC segmenter. 1Now at Google, Inc. 1.2. Language Model A smoothed trigram model was also used to score hypothesized translations. We used the SRI Language Modelling Toolkit to train a language model smoothed with Kneser-Ney discounting. For all of the evaluation conditions, a language model was trained on the English half of the parallel corpus used for alignment-template training. For the “additional” and “unrestricted” evaluation conditions, an additional language model was used that was trained on 800M words of monolingual news text. Each language model is considered an independent information source, and is weighted separately in the global log-linear model. 1.3. Minimum Error Rate Training The individual model weights of the log-linear model are set using a parameter tuning procedure that minimizes the error rate of a given evaluation function (such as the BLEU score) on a held-out test corpus. Setting model weights in order to minimize the error of the function used for testing has been shown to provide better results than maximumlikelihood training [3]. For this evaluation, we optimize parameters to achieve the best performance with respect to the BLEU score. We split the provided development data into two equally sized corpora that were used separately for minimum error training and testing.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

MT at the Paragraph Level: Improving English Synthesis in SYSTRAN

In Machine Translation (MT), output quality can be seriously affected by multi-sentence and multi-clause phenomena such as pronominalization, multisentence quotation, comma placement, etc. Yet almost all MT systems operate sentence-by-sentence. This paper describes the transfer of some research on paragraph structure from the research laboratory (USC/ISI) to commercial practice (SYSTRAN). It ou...

متن کامل

Integrating Knowledge Bases and Statistics in MT

2 System Design: Philosophy We summarize recent machine translation (MT) research at the Information Sciences Institute of USC, and we describe its application to the development of a Japanese-English newspaper MT system. Our work aims at scaling up grammar-based, knowledge-based MT techniques. This scale-up involves the use of statistical methods, both in acquiring e ective knowledge resources...

متن کامل

File : draft - ietf - rsvp - md 5 - 07 . txt Bob Lindell USC / ISI Mohit Talwar USC / ISI

Cisco File: draft-ietf-rsvp-md5-07.txt Bob Lindell USC/ISI Mohit Talwar USC/ISI RSVP Cryptographic Authentication Status of this Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft ...

متن کامل

Profile : USC / ISI Polymorphic Robotics Laboratory 1 USC / ISI Polymorphic Robotics Laboratory

Multi-database mining is an important research area because (1) there is an urgent need for analyzing data in different sources, (2) there are essential differences between monoand multi-database mining, and (3) there are limitations in existing multi-database mining efforts. This paper designs a new multidatabase mining process. Some research issues involving mining multi-databases, including ...

متن کامل

Generation From Lexical Conceptual Structures

This paper describes a system for generating natural language sentences from an interlingual representation, Lexical Conceptual Structure (LCS). This system has been developed as part of a Chinese-English Machine Translation system, however, it promises to be useful for many other MT language pairs. The generation system has also been used in CrossLanguage information retrieval research (Levow ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2004

The ISI/USC MT system

نویسندگان

چکیده

منابع مشابه

MT at the Paragraph Level: Improving English Synthesis in SYSTRAN

Integrating Knowledge Bases and Statistics in MT

File : draft - ietf - rsvp - md 5 - 07 . txt Bob Lindell USC / ISI Mohit Talwar USC / ISI

Profile : USC / ISI Polymorphic Robotics Laboratory 1 USC / ISI Polymorphic Robotics Laboratory

Generation From Lexical Conceptual Structures

عنوان ژورنال:

اشتراک گذاری